Skip to content

Comments

refactor: cross-platform foundation for macOS support#89

Merged
rgarcia merged 5 commits intomainfrom
refactor/cross-platform-foundation
Feb 11, 2026
Merged

refactor: cross-platform foundation for macOS support#89
rgarcia merged 5 commits intomainfrom
refactor/cross-platform-foundation

Conversation

@rgarcia
Copy link
Contributor

@rgarcia rgarcia commented Feb 10, 2026

Summary

  • Split platform-specific code into _linux.go and _darwin.go files across resources, network, devices, ingress, vmm, and vm_metrics packages
  • Add hypervisor abstraction with registration pattern (RegisterSocketName, RegisterVsockDialerFactory, RegisterClientFactory) to decouple instance management from specific hypervisor implementations
  • Add "vz" to the OpenAPI hypervisor type enum, erofs disk format support, and insecure registry option for builds
  • Cross-compile guest-agent and init binaries for Linux (needed when building on macOS)

No behavioral changes on Linux. macOS can now compile but has no VM functionality yet.

This is part 1 of 2 for macOS/Virtualization.framework support. Part 2 (feat/vz-hypervisor) adds the actual vz hypervisor implementation.

Test plan

  • make build-linux succeeds
  • make test-linux passes (failures are Docker Hub rate limiting, not code changes)
  • CI passes on Linux runner
  • CI passes on macOS runner (compilation check via test-darwin)

🤖 Generated with Claude Code


Note

Medium Risk
Touches VM lifecycle/build pipeline (vsock connection path, builder image preparation, log persistence) and image/disk creation logic, which could affect build reliability and instance startup even on Linux despite being largely additive/guarded.

Overview
Establishes a cross-platform foundation by splitting multiple subsystems into Linux vs macOS implementations (networking, devices/GPU passthrough, resources, VM metrics, VMM/ingress binaries), adding macOS stubs/no-ops where features aren’t supported yet, and updating the API hypervisor enum to include vz.

Refactors hypervisor integration to be registration/factory-based (RegisterClientFactory, platform-specific starters, per-instance GetVsockDialer) and adds platform-specific hypervisor access checks (KVM on Linux vs Virtualization.framework on Apple Silicon).

Improves the build system: builder-agent now streams logs over vsock while retaining an authoritative buffered log; build manager can bootstrap a builder image locally from an embedded Dockerfile via Docker + OCI cache import when BUILDER_IMAGE is unset, gates build execution until the builder image is ready, and switches to per-instance vsock dialing. Image handling is adjusted to always target Linux VM platform for pulls/manifests and disk image creation is made 4K-sector aligned for Virtualization.framework compatibility.

Written by Cursor Bugbot for commit 964eacb. This will update automatically on new commits. Configure here.

Split platform-specific code into _linux.go and _darwin.go files across
resources, network, devices, ingress, vmm, and vm_metrics packages.
Add hypervisor abstraction with registration pattern (RegisterSocketName,
RegisterVsockDialerFactory, RegisterClientFactory) to decouple instance
management from specific hypervisor implementations. Add "vz" to the
OpenAPI hypervisor type enum, erofs disk format support, and insecure
registry option for builds.

No behavioral changes on Linux. macOS can now compile but has no VM
functionality yet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 10, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

refactor: cross-platform foundation for macOS support
⚠️ hypeman-openapi studio · code

There was a regression in your SDK.
generate ⚠️

⚠️ hypeman-typescript studio · code

There was a regression in your SDK.
generate ❗build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/da5e5532f852ace9a21fb273fd2d73f8b643db2e/dist.tar.gz
⚠️ hypeman-go studio · code

There was a regression in your SDK.
generate ⚠️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@8adc4f38026abee34ad85c15509e90f47644a0d0
⚠️ hypeman-cli studio · conflict

There was a regression in your SDK.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-02-11 01:11:19 UTC

@cursor

This comment has been minimized.

@rgarcia rgarcia requested a review from sjmiller609 February 10, 2026 21:43
- Restore persisting result.Logs to disk after build completion, since
  streamed log lines can be dropped when the bounded channel overflows
- Add docker tag + push after building the builder image locally so it
  is available in the registry for builder VMs to pull
- Synchronize log streaming with build result delivery by waiting for
  logsDone channel before sending build_result, preventing the host
  from closing the connection before all logs are delivered

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, maybe some opportunity for deduplication

Comment on lines +11 to +13
func detectCPUCapacity() (int64, error) {
return int64(runtime.NumCPU()), nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks simpler than the linux one, idk if we could do the same there or not

// Auto-detect from filesystem using statfs
var stat unix.Statfs_t
dataDir := cfg.DataDir
if err := unix.Statfs(dataDir, &stat); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementations for darwin and linux look almost identical here

@cursor

This comment has been minimized.

Rework builder image provisioning:
- Embed generic/Dockerfile with go:embed instead of reading from filesystem
- Remove hardcoded "hypeman/builder:latest" default; if BUILDER_IMAGE is
  unset, build from embedded Dockerfile (dev mode)
- After docker build+save, write image directly into OCI layout cache
  and call ImportLocalImage, bypassing docker push entirely
- Move RecoverPendingBuilds after ensureBuilderImage to prevent race
  where recovered builds fail with "builder image is being prepared"

Also fix review feedback:
- Add sector alignment to erofs disk conversion (macOS VF compat)
- Remove redundant ForLinux OCI client aliases and update callers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rgarcia
Copy link
Contributor Author

rgarcia commented Feb 10, 2026

Addressed the remaining review feedback in 3658218:

Cursorbot bugs (all fixed):

  • RecoverPendingBuilds race — Moved RecoverPendingBuilds() from NewManager() to Start(), running it sequentially after ensureBuilderImage completes. Recovered builds no longer race against builder readiness.
  • Missing erofs sector alignment — Added the same alignToSector + truncate logic that ext4 already had to convertToErofs.
  • Redundant ForLinux aliases — Removed InspectManifestForLinux/PullAndUnpackForLinux and updated the single caller in initrd.go.

Steven's observations:

  • cpu_darwin.go simplicity — Good observation. Linux could potentially use runtime.NumCPU() too, but the /proc/cpuinfo parsing handles edge cases like hyperthreading topology (siblings × sockets) which runtime.NumCPU() may not match on certain server configurations. Worth revisiting as a follow-up but not blocking.
  • disk_darwin.go duplication — Agreed, the implementations are nearly identical. The difference is Darwin uses unix.Statfs_t with a / fallback while Linux uses syscall.Statfs_t without. These could be unified using golang.org/x/sys/unix on both platforms in a follow-up refactor.

Additionally, this commit includes the builder image rework:

  • Embedded the Dockerfile with go:embed instead of reading from filesystem
  • Removed the hardcoded "hypeman/builder:latest" default — if BUILDER_IMAGE is unset, builds from the embedded Dockerfile
  • After docker build + docker save, writes the image directly into the OCI layout cache and calls ImportLocalImage, bypassing docker push entirely

@cursor

This comment has been minimized.

Add three tests verifying the builder image import pipeline:

- TestDockerSaveTarballToOCILayoutRoundtrip: full pipeline from
  docker save tarball → load → OCI layout → existsInLayout →
  extractMetadata → unpackLayers with rootfs verification

- TestDockerSaveToOCILayoutCacheHit: verifies pullAndExport skips
  remote pull when image exists in OCI layout cache (uses bogus
  registry URL that would fail if pull was attempted)

- TestImportLocalImageFromOCICache: end-to-end integration test
  simulating buildBuilderFromDockerfile's flow: write to OCI cache
  → ImportLocalImage → async build → verify GetImage metadata and
  GetDiskPath returns valid ext4 disk

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is ON. A Cloud Agent has been kicked off to fix the reported issue.

@cursor
Copy link

cursor bot commented Feb 11, 2026

Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.

  • ✅ Fixed: DockerSocket config not propagated to build manager
    • Added the missing DockerSocket: cfg.DockerSocket field to the builds.Config struct initialization in providers.go so the environment variable value is properly wired through.

Create PR

Or push these changes by commenting:

@cursor push 262b30fcb7
Preview (262b30fcb7)
diff --git a/lib/providers/providers.go b/lib/providers/providers.go
--- a/lib/providers/providers.go
+++ b/lib/providers/providers.go
@@ -259,6 +259,7 @@
 		RegistryCACert:      registryCACert,
 		DefaultTimeout:      cfg.BuildTimeout,
 		RegistrySecret:      cfg.JwtSecret, // Use same secret for registry tokens
+		DockerSocket:        cfg.DockerSocket,
 	}
 
 	// Apply defaults if not set

Merge disk_darwin.go and disk_linux.go into disk.go since both
implementations are nearly identical — the only difference was
syscall.Statfs (linux) vs unix.Statfs (darwin). Use unix.Statfs
from golang.org/x/sys/unix which works on both platforms.

Addresses review feedback from @sjmiller.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rgarcia rgarcia merged commit 5c29ba8 into main Feb 11, 2026
4 checks passed
@rgarcia rgarcia deleted the refactor/cross-platform-foundation branch February 11, 2026 01:09
rgarcia added a commit that referenced this pull request Feb 11, 2026
disk_darwin.go and disk_linux.go were unified into disk.go in PR #89
but snuck back in during the rebase as new files with no conflicts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rgarcia added a commit that referenced this pull request Feb 12, 2026
disk_darwin.go and disk_linux.go were unified into disk.go in PR #89
but snuck back in during the rebase as new files with no conflicts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rgarcia added a commit that referenced this pull request Feb 14, 2026
disk_darwin.go and disk_linux.go were unified into disk.go in PR #89
but snuck back in during the rebase as new files with no conflicts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rgarcia added a commit that referenced this pull request Feb 15, 2026
* refactor: cross-platform foundation for macOS support

Split platform-specific code into _linux.go and _darwin.go files across
resources, network, devices, ingress, vmm, and vm_metrics packages.
Add hypervisor abstraction with registration pattern (RegisterSocketName,
RegisterVsockDialerFactory, RegisterClientFactory) to decouple instance
management from specific hypervisor implementations. Add "vz" to the
OpenAPI hypervisor type enum, erofs disk format support, and insecure
registry option for builds.

No behavioral changes on Linux. macOS can now compile but has no VM
functionality yet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add macOS VM support via Apple Virtualization.framework

Add vz hypervisor implementation that runs VMs on macOS using Apple's
Virtualization.framework via a codesigned subprocess (vz-shim). Includes
vsock-based guest communication, shared directory mounts for disk access,
and macOS-native networking via vmnet.

Key components:
- cmd/vz-shim: subprocess that creates and manages vz VMs
- lib/hypervisor/vz: starter, client, and vsock dialer for vz
- Makefile targets: build-darwin, test-darwin, dev-darwin, sign-darwin
- CI: macOS runner for test-darwin
- scripts/install.sh: macOS support (launchd, Homebrew, codesign)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: embed vz.entitlements and fix macOS runtime issues

- Embed vz.entitlements as a Go resource and write it to a temp file at
  runtime for codesigning, replacing the broken entitlementsPath() that
  looked for the file next to the executable
- Add vz-shim copy step in .air.darwin.toml so the go:embed directive
  can find the binary during dev builds
- Add --entitlements flag to codesign in install.sh download path so
  binaries receive the virtualization entitlement
- Prepend /opt/homebrew/opt/e2fsprogs/sbin to launchd plist PATH so
  mkfs.ext4 from keg-only e2fsprogs is found at runtime

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove stale disk platform files from rebase

disk_darwin.go and disk_linux.go were unified into disk.go in PR #89
but snuck back in during the rebase as new files with no conflicts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: vsock proxy data loss, zombie reaping, and remove vz-shim from install

- Read from bufio.Reader instead of raw conn in vsock proxy to prevent
  silent data loss when the buffered reader consumed beyond the newline
- Replace cmd.Process.Release() with go cmd.Wait() to properly reap
  vz-shim child processes instead of leaving zombies
- Update hypervisor README to reflect vz subprocess model (not in-process)
- Remove vz-shim from install/uninstall scripts (it's embedded in
  hypeman-api and extracted at runtime)
- Add CLI smoke tests (hypeman ps, hypeman images) to e2e install test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: e2e test config sourcing and missing CLI handling

- Extract JWT_SECRET/PORT with grep instead of sourcing the config file,
  which breaks on macOS where paths contain spaces
- Skip CLI smoke tests gracefully when CLI binary is not installed
  (e.g., no darwin/arm64 release available)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove unused registry-push flag from gen-jwt

Builder images are now auto-built on startup, so manual push workflow
and the -registry-push flag are no longer needed. The underlying
repo_access JWT infrastructure remains for other registry auth flows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove unnecessary .gitkeep for vz-shim embed dir

The vz-shim embed is darwin-only (build tag), so the directory isn't
needed on Linux. On macOS the Makefile creates it before compiling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add VM lifecycle smoke test to e2e install test

Tests pull, run, exec, stop, and rm using the CLI against a real
alpine VM to verify the full stack works end-to-end after install.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: create vz-shim embed directory in Makefile before copying

The .gitkeep was removed so the directory no longer exists in the repo.
The Makefile needs to mkdir -p before copying the built binary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix macOS CLI install missing $SUDO prefix

The macOS CLI install on line 653 used bare 'install' while all other
binary installs to $INSTALL_DIR used '$SUDO install'. When /usr/local/bin
isn't writable and $SUDO is set to 'sudo', this caused a permission error
that aborted the script (due to set -e) after the service was already
running, leaving a partial installation.

Applied via @cursor push command

* fix: CLI install on macOS and make CLI a hard requirement in e2e

CLI releases use goreleaser naming ("macos" not "darwin", .zip not
.tar.gz). Fix artifact lookup and extraction to handle both formats.

Make CLI presence a hard fail in e2e test — if the install script
can't install the CLI, that's a real failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove nonexistent 'hypeman images' from e2e test

The CLI doesn't have an 'images' subcommand. The VM lifecycle tests
(pull, run, exec, stop, rm) cover real functionality.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: retry hypeman run after async pull in e2e test

Image pulls are async — 'hypeman pull' returns immediately with
status:pending. Retry 'hypeman run' in a loop until the image
is available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clean up macOS development docs and Makefile comment

- Remove "Alternative Commands" section (make dev covers it)
- Remove known limitations that are implementation details or wrong:
  disk format is handled automatically, snapshots aren't supported,
  network ingress is internal, vz-shim is a subprocess not in-process
- Keep disk format and snapshots as brief notes
- Makefile: 'run' target comment says "for agents" not "for testing"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: README treats Linux and macOS as equal platforms

- Requirements: remove "Production"/"Experimental" labels
- Quick Start: "Linux and macOS supported"
- CLI section: reword for local-first usage, remove "remote" framing
- Remove entire "macOS Support" section (platform details belong in
  DEVELOPMENT.md, not the user-facing README)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address PR review: fix shutdown semantics, reduce timeout, add vz integration tests

Changes based on PR review feedback:
- Reduce vz HTTP client timeout from 30s to 10s (local Unix socket)
- Add comment on 2GB memory safety default in vz-shim
- Fix graceful shutdown to only send ACPI power button without immediate
  force-kill fallback, aligning with CH/QEMU semantics
- Add macOS vz integration tests (TestVZBasicLifecycle, TestVZExecAndShutdown)

Test infrastructure improvements:
- Use short /tmp/ paths for vz test temp dirs to avoid macOS 104-byte
  Unix socket path limit (t.TempDir() paths are too long)
- Capture vz-shim stderr and log file contents in error messages for
  better diagnostics when shim fails to start

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(vz): fix guest-agent exec format error on instance restart

After a force-kill (vm.Stop), the overlay filesystem could have a
corrupted guest-agent binary. The lazy copy optimization skipped
re-copying the binary if it already existed, causing exec format
error on restart. Always copy from initrd to ensure correctness.

Also adds restart coverage to TestVZBasicLifecycle (stop → start →
exec → verify) with diagnostic log dumping on failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address hiroTamada review nits

- Warn on codesign failure instead of silently swallowing (install.sh)
- Fix vz control interface description: HTTP, not gRPC (README.md)
- Remove dead if/else that set same path on both branches (e2e-install-test.sh)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: respect NetworkEnabled=false in vz shim by not creating NIC when networks is empty

When NetworkEnabled=false, the instance's Networks slice is intentionally
empty. The vz shim was incorrectly treating an empty networks slice as
'add default NAT NIC', which gave the guest network access even when
the caller explicitly disabled networking.

Now, when networks is empty, configureNetwork returns immediately without
attaching any NIC, matching the behavior of QEMU and Cloud Hypervisor.

Applied via @cursor push command

* Fix StartInstance call to match new signature

StartInstance now takes a StartInstanceRequest parameter (from PR #99).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix vz tests: keep VM alive with sleep infinity

After PR #99, init does reboot(POWER_OFF) when the entrypoint exits.
Alpine's default entrypoint (/bin/sh) exits immediately with no stdin,
killing the VM before exec tests can run. Add Cmd: sleep infinity to
keep the VM alive, matching the pattern in volumes_test.go.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix e2e test: use nginx:alpine instead of alpine:latest

After PR #99, init does reboot(POWER_OFF) when the entrypoint exits.
Alpine's /bin/sh exits immediately with no stdin, killing the VM before
exec can run. nginx:alpine has a long-running daemon entrypoint that
keeps the VM alive, matching the pattern in exec_test.go.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix e2e clean slate: remove data dir from previous failed runs

Phase 1 was calling uninstall.sh without KEEP_DATA=false, so the data
directory (including stale VMs from previous failed runs) persisted.
This caused name_conflict errors when the test tried to create
e2e-test-vm again.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants